[feat] Allow middle checkpoint evaluation in background using lmms-eval http server by kcz358 · Pull Request #127 · EvolvingLMMs-Lab/lmms-engine

kcz358 · 2026-01-09T06:04:55Z

Motivation

Modifications

Commit Message Convention

Please follow our standardized commit message format:

[feat] - New features or functionality
[fix] - Bug fixes
[docs] - Documentation changes only
[style] - Code style changes (formatting, missing semicolons, etc.)
[refactor] - Code refactoring without changing functionality
[perf] - Performance improvements
[test] - Adding or updating tests
[chore] - Maintenance tasks, dependency updates, etc.
[ci] - CI/CD configuration changes

Examples:

[feat] add qwen omni iterable dataset support
[fix] resolve bagel model configuration error
[docs] update training guide with YAML examples

See CONTRIBUTING.md for more details.

CI/CD Checks

Your PR will automatically run the following checks:

Linting: Code formatting with black (line-length=120) and import sorting with isort
Run pre-commit run --all-files locally to verify before pushing

Checklist

Follow commit message convention (see above)
Run pre-commit run --all-files and ensure all checks pass
Format your code with black (line-length=120) and isort
Add unit tests for new functionality
Update documentation as needed, including docstrings or example tutorials
Ensure all CI/CD checks pass

…t init

Add utilities for merging sharded FSDP2 checkpoints into single consolidated checkpoints for evaluation and inference. Includes base class and FSDP2 implementation with support for both regular and EMA checkpoints.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdb5738e4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-09T06:08:05Z

src/lmms_engine/train/fsdp2/fsdp2_trainer.py

+            logger.info("Waiting for pending evaluation jobs to complete...")
+            while len(self.eval_backend.pending_evals) > 0:
+                for eval_step, metrics in self.eval_backend.check_and_get_completed():
+                    if rank == 0:


Drain last eval results after pending jobs finish

When wait_until_complete=True, the loop condition only checks pending_evals. If the final job completes between the last check_and_get_completed() call and the next while condition check, the worker thread will move the job into results_queue and remove it from pending_evals, causing the loop to exit without ever logging that last result. This drops the final metrics at end-of-training; you can fix by draining the queue once more after the loop or by looping until both pending_evals is empty and the results queue is empty.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-09T06:08:05Z

src/lmms_engine/merger/fsdp2.py

+            for model_state_shard in shard_state_dicts:
+                tensor = model_state_shard.pop(key)
+                state_dict[key].append(tensor._local_tensor.bfloat16())


Preserve original dtype when consolidating shards

The merger unconditionally converts each shard tensor to bfloat16 before concatenation. That will silently downcast checkpoints trained in fp32 or fp16, which can degrade accuracy or break downstream assumptions about dtype. Since this is a merge utility, it should preserve the original dtype from the shards rather than forcing bfloat16.

Useful? React with 👍 / 👎.

kcz358 · 2026-01-09T06:09:57Z

First install lmms-eval to use the eval client, no need for a full installation since only the client side are needed

cd /path/to/lmms-eval
uv pip install --no-deps . (or -e . for editable)
uv pip install fastapi uvicorn

Add the eval config in the yaml

  eval_config:
    server_url: "http://192.168.8.249:8000"
    poll_interval: 10.0
    checkpoint_key: "model"
    checkpoint_type: "regular"
    num_gpus: 8
    batch_size: 256

The wandb will log the eval result. There are something need to be noticed:

since we disaggregated the training and evaluation side, checkpoints need to be saved before eval so save steps will need be equal to the eval steps.
You might need to save the middle checkpoints (set a higher save total limit) so that we don't clean up the checkpoints before testing
You will need to setup an eval server that shares the file storage system to training system right now.
The evaluation runs quietly in the background.

…al http server (#127) * rfc ema utils so that the attribute is being retrieved after the first init * [feat] Add FSDP2 checkpoint merger module Add utilities for merging sharded FSDP2 checkpoints into single consolidated checkpoints for evaluation and inference. Includes base class and FSDP2 implementation with support for both regular and EMA checkpoints. * [feat] Add eval server backend for asynchronous checkpoint evaluation * [feat] Integrate eval server backend into FSDP2 trainer * [feat] Add eval optional dependency with httpx * [feat] Add lmms_engine_kwargs support for checkpoint merging * [feat] Pass checkpoint_type to eval backend in validation_step * [feat] Update version and config for eval/EMA features * [fix] Fix EvalClient import and add eval_output_dir parameter * [refactor] Remove output_dir and check_interval from EvalConfig * [feat] Add eval_strategy check and wait for eval completion * [feat] Define global_step as step_metric for eval metrics in wandb * [feat] Use global_step in metrics for eval results logging * [docs] Add async eval guide and update merge FSDP documentation

kcz358 added 13 commits January 8, 2026 09:33

rfc ema utils so that the attribute is being retrieved after the firs…

19adf83

…t init

[feat] Add FSDP2 checkpoint merger module

b7c58c3

Add utilities for merging sharded FSDP2 checkpoints into single consolidated checkpoints for evaluation and inference. Includes base class and FSDP2 implementation with support for both regular and EMA checkpoints.

[feat] Add eval server backend for asynchronous checkpoint evaluation

9f4a25b

[feat] Integrate eval server backend into FSDP2 trainer

78a3461

[feat] Add eval optional dependency with httpx

af118aa

[feat] Add lmms_engine_kwargs support for checkpoint merging

e5789f1

[feat] Pass checkpoint_type to eval backend in validation_step

891715d

[feat] Update version and config for eval/EMA features

a6792c2

[fix] Fix EvalClient import and add eval_output_dir parameter

0c27969

[refactor] Remove output_dir and check_interval from EvalConfig

d723077

[feat] Add eval_strategy check and wait for eval completion

cdb5738

[feat] Define global_step as step_metric for eval metrics in wandb

9531aa6

[feat] Use global_step in metrics for eval results logging

68753c1

chatgpt-codex-connector bot reviewed Jan 9, 2026

View reviewed changes

[docs] Add async eval guide and update merge FSDP documentation

4288373

kcz358 merged commit 3349d7b into main Jan 12, 2026
2 checks passed

kcz358 deleted the dev/eval_checkpoint branch January 12, 2026 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Allow middle checkpoint evaluation in background using lmms-eval http server#127

[feat] Allow middle checkpoint evaluation in background using lmms-eval http server#127
kcz358 merged 14 commits intomainfrom
dev/eval_checkpoint

kcz358 commented Jan 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 9, 2026

Uh oh!

chatgpt-codex-connector bot Jan 9, 2026

Uh oh!

kcz358 commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kcz358 commented Jan 9, 2026

Motivation

Modifications

Commit Message Convention

CI/CD Checks

Checklist

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant